Course: Applied Geo-Data Science at the University of Bern (Institute of Geography)
Supervisor: Prof. Dr. Benjamin Stocker
Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider
Further information: https://geco-bern.github.io/agds/
Do you have questions about the workflow? Contact the Author:
Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)
Matriculation number: 20-100-178
Reference: Report Exercise 7 (Chapter 10)
In this exercise we explore the role of structure in the data for model generalisability and how to best estimate a “true” out-of-sample error that corresponds to the prediction task. The task here is to train a model on ecosystem flux observations from one site and predict them for another site (spatially upscaling). With a realistic problem, the following goals should be trained:
Concept of cross validation
spatially up-scalling
The theory is very similar to re_ml_01. We change only our method to cross-validation. Validation is not the same as testing the data. Validation is a concept that will be applied during the data training.
Cross-validation (cv) is a resampling method that randomly divides the training data into n-folds of approximately equal size. We fit the model for n-1 folds and use the remaining fold to compute model performance. We repeat that n times, and for each time we use a different fold as a validation set. After we finished this procedure, we had n estimates of the generalization error. Thus, the n-fold cv estimate is computed by averaging the n test errors, providing us with an approximation of the error we might expect on unseen data. There is no formal rule as to the size of n. However, as n gets larger, the difference between the estimated performance and the true performance to be seen on the test set will decrease. But we will need more time to compute the models.
For KNN, k is the only hyper-parameter. In exercise re_ml_01, we demonstrated how to find optimal k. If we train our data with the cv-method, we get a model that has a better bias-variance trade-off than the models before. With optimal k and cross-validation, we have a “true” out-of-sample prediction. That is why we implemented a cv-KNN algorithm.
We split our data and use 80% for training and 20% for model evaluation. We use 10 folds for cross-validation. Because the algorithm uses distance, we have to standardize our data. With the box-cox function, we transform our data into a normal distribution. We use set.seed for reproducibility.
The open-source program R-Studio (Version 2022.12.0+353) was used for all studies. The data handling in this program is done using the R language. We utilize the R-package “Tidyverse” and other packages to process the data effectively.
The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to choose if different functions have the same call, but do not make the same thing (a function in a certain package can have the same name as a function from another package). In this case, we will set preferences.
source("../../R/general/packages.R")
First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented an if-else statement. We know, that we can do easier. But we want to demonstrate, that there are many ways to read the file.
name.of.file <- "../../data/re_ml_02/daily_fluxes.davos.csv"
# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
# Access to the data
url.1 <- "https://raw.githubusercontent.com/geco-bern/agds/main/data/FLX_C
-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"
# Read in the data directly from URL
daily_fluxes.davos <- read.table(url.1, header = TRUE, sep = ",")
# Write a CSV file in the respiratory
write_csv(daily_fluxes.davos, "../../data/re_ml_02/daily_fluxes.davos.csv")
# Read the file
daily_fluxes.davos <- read_csv("../../data/re_ml_02/daily_fluxes.davos.csv")
# If exists such a file, read it only!
}else{daily_fluxes.davos <- read_csv("../../data/re_ml_02/daily_fluxes.davos.csv")}
name.of.file <- "../../data/re_ml_02/daily_fluxes.laegern.csv"
# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
# Access to the data
url.2 <- "https://raw.githubusercontent.com/geco-bern/agds/main/data/FLX_CH
-Lae_FLUXNET2015_FULLSET_DD_2004-2014_1-4.csv"
# Read in the data directly from URL
daily_fluxes.davos <- read.table(url.1, header = TRUE, sep = ",")
# Write a CSV file in the respiratory
write_csv(daily_fluxes.laegern, "../../data/re_ml_02/daily_fluxes.laegern.csv")
# Read the file
daily_fluxes.laegern <- read_csv("../../data/re_ml_02/daily_fluxes.laegern.csv")
# If exists such a file, read it!
}else{daily_fluxes.laegern <- read_csv("../../data/re_ml_02/daily_fluxes.laegern.csv")}
We can see that some columns contain -9999 as a value. Our quality function changed that to NA. Then we use ymd() from the “lubridate” package to rewrite the date in a proper way. Further, we want only columns that contain good quality. For that, we check selected columns against their quality control columns. If the proportion of good measurement is less than 80%, then we overwrite the value with NA (and do not drop the row!). Now we know that our data has high quality, and we can perform our analyses with it.
Here, we work with two locations (Davos and Lägern). But we do not want to do everything twice. That is why we define a new data frame that contains all the information about the two locations. We use .id = “id” because we must know where the values come from. But this is not so elegant. We change the column name (id = Location) and replace the values (1 = Davos, 2 = Lägern). If we do that, we can use the filter function from the package “dplyr” to efficiently analyze our data.
We can see that there are many missing values in incoming long-wave (LW_IN_F) in Davos and precipitation (P_F) in Lägern. We would lose almost half of our data in our recipe call (data training). That is why we do not use this column. After we descard the columns we make a quality check again and we see: our data is ok! We bind our rows and make a quality check one last time. Now we are sure, that everything is ok.
# Load the function in the file (we use the function from another markdown again)
source("../../R/re_ml_01/function.use.good.quality.only.R")
source("../../R/general/function.visualize.na.values.R")
# Function call to clean the data. But the quality is poor and we do not use them!
daily_fluxes.dav.no <- use.good.quality.only(daily_fluxes.davos)
daily_fluxes.lag.no <- use.good.quality.only(daily_fluxes.laegern)
# Visualize the result ---> many NA! We would loos to much data.
visualize.na.values.without.groups(daily_fluxes.dav.no, c(" Davos: Problem with LW_IN_F!"))
visualize.na.values.without.groups(daily_fluxes.lag.no, c(" Lägern: Problem with P_F!"))
# Function call to clean the data (We use another function to clean our data)
daily_fluxes.dav <- use.good.quality(daily_fluxes.davos)
daily_fluxes.lag <- use.good.quality(daily_fluxes.laegern)
# Visualize the results --> Now we can work!
visualize.na.values.without.groups(daily_fluxes.dav, c("Davos: Everything ok!"))
visualize.na.values.without.groups(daily_fluxes.lag, c("Lägern: Everything ok!"))
# We create a new data frame with bind_rows. We us "id" as an identifier.
daily_fluxes_both <- bind_rows(daily_fluxes.dav, daily_fluxes.lag, .id = "id")
# We change the ID in our location names
daily_fluxes_both <- daily_fluxes_both|>
rename("Location" = "id")|>
mutate(Location = ifelse(Location == 1, "Davos", "Lägern"))
# Function call: we make a quality control one last time --> everything is ok!
visualize.na.values.without.groups(daily_fluxes_both, c("Pooled: Everything ok!"))
Here, we split our data into a training set and a test set. First, we will split the data (80% training and 20% testing). We set the seed for a pseudo-random choice (for reproducibility).
# For reproducibility (pseudo-random)
set.seed(123)
# Split 80 % to 20 %
split_davos <- rsample::initial_split(daily_fluxes_both|>
dplyr::filter(Location == "Davos"),
prop = 0.8, strata = "VPD_F")
split_laegern <- rsample::initial_split(daily_fluxes_both|>
dplyr::filter(Location == "Lägern"),
prop = 0.8, strata = "VPD_F")
split_both <- rsample::initial_split(daily_fluxes_both, prop = 0.8, strata = "VPD_F")
# Split Davos
daily_fluxes_davos_train <- rsample::training(split_davos)
daily_fluxes_davos_test <- rsample::testing(split_davos)
# Split Lägern
daily_fluxes_laegern_train <- rsample::training(split_laegern)
daily_fluxes_laegern_test <- rsample::testing(split_laegern)
# Split pooled
daily_fluxes_both_train <- rsample::training(split_both)
daily_fluxes_both_test <- rsample::testing(split_both)
We use a sequence to determine the optimal k. The model with an optimal k has the smallest mean absolute error (MAE). But we are also interested in other metrics, like RSQ. Our function will determine RSQ as well. We create three models:
We can see that the optimal k is 50. This seems odd because the data are the same as for the exercise re_ml_01 and there was our optimal k = 15. But we have to consider that we used a different approach to create our model. We use “cross-validation,” which is why we find different optimal k for the same data (and the same set.seed). Further, in this exercise we us 80% instead of 70% of the data set for training.
We also used other sequences to find the optimal k (because with the call below we only know optimal k lies in a certain interval). Only the results are presented here. We set eval = false because the time to compute the value is high. If you want to check, please run the code manually (it will work).
source("../../R/re_ml_02/function.parameter.extracter.cv.R")
# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5))
# Visualize the MAE and RSQ --> optimal k is 50, note that 10 is the number of n-folds
parameter.extracter.knn.cv(my.sequence,
daily_fluxes_davos_train,
daily_fluxes_davos_test, 10)
We make the same thing for Lägern. Here is optimal k equal to 23. We set eval = false because the time to compute the value is high. If you want to check, please run the code manually (it will work).
source("../../R/re_ml_02/function.parameter.extracter.cv.R")
# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5), 23)
# Visualize the MAE and RSQ --> optimal k is 23 (between 20:30)
parameter.extracter.knn.cv(my.sequence,
daily_fluxes_laegern_train,
daily_fluxes_laegern_test, 10)
We do the same thing for the pooled data, and we find an optimal k equal to 49. Please be patient. The calculation time is high.
source("../../R/re_ml_02/function.parameter.extracter.cv.R")
# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5),49)
# Visualize the MAE and RSQ --> optimal k is 25 (between 20:30)
parameter.extracter.knn.cv(my.sequence,
daily_fluxes_both_train,
daily_fluxes_both_test, 10)
Now we know the optimal k for each model, and we recalculate them.
source("../../R/re_ml_02/function.knn.cv.model.R")
# Function call for Davos, 50 = k, 10 = cv
knn.model.davos.optimal <- knn.cv.model(daily_fluxes_davos_train, 50, 10)
# Function call for Lägern, 23 = k, 10 = cv
knn.model.laegern.optimal <- knn.cv.model(daily_fluxes_laegern_train, 23, 10)
# Function call for Davos and Lägern, 49 = k, 10 = cv
knn.model.both.optimal <- knn.cv.model(daily_fluxes_both_train, 49, 10)
We evaluate our models. For that, we compare each train data to all other test data. It follows, that we need 9 evaluations.
# Load the function
source("../../R/re_ml_01/function.evaluation.model.R")
# Model Davos
P_1 <- eval_model(knn.model.davos.optimal, daily_fluxes_davos_train,
daily_fluxes_davos_test,
c("Davos (opt. k = 50)"), c("Davos"))
P_2 <- eval_model(knn.model.davos.optimal, daily_fluxes_davos_train,
daily_fluxes_laegern_test,
c("Davos (opt. k = 50)"), c("Lägern"))
P_3 <- eval_model(knn.model.davos.optimal, daily_fluxes_davos_train,
daily_fluxes_both_test,
c("Davos (opt. k = 50)"), c("Davos and Lägern"))
# Model Lägern
P_4 <- eval_model(knn.model.laegern.optimal, daily_fluxes_laegern_train,
daily_fluxes_davos_test,
c("Lägern (opt. k = 23)"), c("Davos"))
P_5 <- eval_model(knn.model.laegern.optimal, daily_fluxes_laegern_train,
daily_fluxes_laegern_test,
c("Lägern (opt. k = 23)"), c("Lägern"))
P_6 <- eval_model(knn.model.laegern.optimal, daily_fluxes_laegern_train,
daily_fluxes_both_test,
c("Lägern (opt. k = 23)"), c("Davos+Lägern"))
# Model Davos und Lägern
P_7 <- eval_model(knn.model.both.optimal, daily_fluxes_both_train,
daily_fluxes_davos_test, c("Davos and Lägern (opt. k = 49)"),
c("Davos"))
P_8 <- eval_model(knn.model.both.optimal, daily_fluxes_both_train,
daily_fluxes_laegern_test, c("Davos and Lägern (opt. k = 49)"),
c("Lägern"))
P_9 <- eval_model(knn.model.both.optimal, daily_fluxes_both_train,
daily_fluxes_both_test, c("Davos and Lägern (opt. k = 49)"),
c("Davos+Lägern"))
# Davos: Within side, across side and both locations
cowplot::plot_grid(P_1, P_2, P_3, ncol = 1)
# Lägern: Within side, across side and both locations
cowplot::plot_grid(P_4, P_5, P_6, ncol = 1)
# Pooled: Within side, across side and both locations
cowplot::plot_grid(P_7, P_8, P_9, ncol = 1)
We want more information about the residue. We use a flexible function to get this information. We plot a time variation. Now we can analyze whether there are any patterns. We use the pooled KNN-model to predict the locations because we are interested in a very generalized model. We visualize the whole data set (all observations and all test data), and we can see that there is annual cycle for both locations.
# Load the function into the file
source("../../R/re_ml_02/function.plot.function.R")
# We plot the actually measured GPP in the pooled data set.
plot.time.variation(daily_fluxes_both,
daily_fluxes_both$TIMESTAMP,
daily_fluxes_both$GPP_NT_VUT_REF,c("Overview about the pooled measurements"),
sub.lab = c("optimal k = 49"),prediction = FALSE)
# We plot the test data set from the pooled data set. We did not do any prediction
plot.time.variation(daily_fluxes_both_test,
daily_fluxes_both_test$TIMESTAMP,
daily_fluxes_both_test$GPP_NT_VUT_REF, c("Overview about the pooled test data"),
prediction = FALSE, sub.lab = c("optimal k = 49"))
# Load the function into the file
source("../../R/re_ml_02/function.plot.function.R")
# We plot the predicted test data set.
plot.time.variation(daily_fluxes_both_test,
daily_fluxes_both_test$TIMESTAMP,
daily_fluxes_both_test$GPP_NT_VUT_REF,c("Overview about the predicted pooled test data"),
prediction = TRUE, mod1 = knn.model.both.optimal, sub.lab = c("optimal k = 49"))
First, we take a look at Davos. We can see that the model underestimates the GPP values. But more important is the fact, that there is a annual cycle. In summer, the model overestimate the GPP and in winter the model underestimate GPP. We want to know, where exactly and why. To answer this questions, we push our model to a higher resolution. We take a closer look and plot the residue as a function of the day of the year. Now we are sure that our model has a negative bias for the winter season and a positive bias for the summer.
# Load the function
source("../../R/re_ml_02/function.vis.residue.R")
# Residue for Davos: train: pooled, test: Lägern (year)
time.variation.year(knn.model.both.optimal, daily_fluxes_davos_test,
c("Train: Pooled, Test: Davos, k = 49"),
c("re_ml_2 (Chapter 10)"))
# We want a better resolution of the residue of Davos (daily)
time.variation.year(knn.model.both.optimal, daily_fluxes_davos_test,
c("Train: Pooled, Test: Davos, k = 49"),
c("re_ml_2 (Chapter 10)"), annual = FALSE)
We do the same thing for Lägern. We can also see that there is an annual cycle. The model overestimates the GPP values during the summer and underestimates them during the winter. To explain that, we need a higher resolution. We plot the residue as a function of the day of the year.
# Load the function
source("../../R/re_ml_02/function.vis.residue.R")
# Residue for Davos: train: pooled, test: Lägern (year)
time.variation.year(knn.model.both.optimal, daily_fluxes_laegern_test,
c("Train: Pooled, Test: Lägern, k = 49"),
c("re_ml_2 (Chapter 10)"))
# We want a better resolution of the residue of Lägern (daily)
time.variation.year(knn.model.both.optimal, daily_fluxes_laegern_test,
c("Train: Pooled, Test: Lägern, k = 49"),
c("re_ml_2 (Chapter 10)"), annual = FALSE)
We have merged two data sets. But the two locations are very different. Table 1 give you a overview:
Table 1: Comparison of the measuring stations (Davos vs. Lägern)
| Quantity | Davos | Lägern |
|---|---|---|
| Coordinates2: | 46.81°N, 9.84°E | 47.48°N, 8.4°E |
| Elevation2: | 1639 m.a.s.l. | 689 m.a.s.l. |
| Exposure2: | South-East slope | Summit |
| 2m Temperature3 | 2.8° C | 8.3° C |
| Precipitation3 | 1062 mm per year | 1100 mm per year |
| Climate3 [Koeppen] | Tundra | - |
| Hours of sunshine2 | 1537 h - 2076 h | 1462 h - 2152 h |
| Wind2 | 8 km/h | 17 km/h |
| Vegetation3: | ENF (Evergreen Needleleaf Forests: Lands dominated by woody vegetation with a percent cover >60% and height exceeding 2 meters. Almost all trees remain green all year. Canopy is never without green foliage | MF (Mixed Forests: Lands dominated by trees with a percent cover >60% and height exceeding 2 meters. Consists of tree communities with interspersed mixtures or mosaics of the other four forest types. None of the forest types exceeds 60% of landscape.) |
We tried different approaches to modeling GPP. We took some metrics and hyper-parameter into account (e.g., RSQ, MAE, k). Our goal was to create a model that is generalizable. That means that we want a model that can predict GPP in different locations (Davos or Lägern). After we found the probably best model, we applied the model to predict the test data. The difference between predicted test data and test data is called the bias. With this parameter, we are able to discuss our analysis.
We can see that if we train our data with cross-validation, our true out-of-sample performed best on the test data from the same location. If we predict another location, the model performs worst. If we predict pooled data, the model performs better. That is not surprising at all. We take a closer look:
Davos:
If we train a model with data from Davos and optimize it (k = 50), RSQ is 0.73 (which is actually very good). The RMSE is 1.43. If we predict the test data from Davos, the RSQ is 0.73 and the RMSE is 1.44. The bias is about -0.06 which means our model underestimate gpp in general. If we predict Lägern, RSQ is decreasing to 0.52, and RMSE is increasing to 3.24. It follows that we cannot predict Lägern with a Davos model. If we predict pooled data, RMSE is decreasing to 0.61 and increasing to 2.24. The bias is about 0.42.
Lägern:
For Lägern, the situation is quite similar to Davos. If we train a model with Lägern data and optimize it (k = 23), RSQ is 0.66 and RMSE is 2.44. The bias is 0.01 which means the model overestimate GPP in general. We also see that if we predict Davos, RSQ decreasing to 0.55 and RMSE increasing to 2.33. The bias increase to -1.07. If we predict pooled data, RSQ decreases to 0.61 and RMSE increases to 2.32. The bias is about -0.68 and therefore higher than for the Davos model (but opposite sign!).
Pooled:
Our goal is to create a model that can predict different locations. To achieve this, we trained a model and optimized it (k = 38). If we predict pooled data, RSQ is 0.67, and RMSE is 2. The bias is almost zero. If we predict Davos, RSQ increases to 0.71, RSME decreases to 1.53, and the bias to about -0.29. This is remarkable and surprising. This is almost the same model performance as the Davos model. If we predict Lägern, RSQ decreases to 0.64, RMSE increases to 2.62, and the bias to 0.59.
We also see, that for Davos, the model overestimates GPP with almost \(2\:\mu mol*m^{-2}*s^{-1}\) during summer-season and underestimate GPP with almost \(-2\:\mu mol*m^{-2}*s^{-1}\) during winter-season. For all other seasons, the mean of the prediction is almost zero. For Lägern, the model underestimates GPP with almost \(5\:\mu mol*m^{-2}*s^{-1}\) during summer and almost \(-5\:\mu mol*m^{-2}*s^{-1}\) during winter. It seems that the model performs way better for Davos than for Lägern. The reason could be that we have a lot more data for Davos (6574 rows) than for Lägern (4018 rows).
Before we interpret the difference, we must know what we want to do. Are we only interested in a specific location? Or is our interest in predicting GPP in general?
Let’s assume we are interested in a location. It is obvious that we get a better model if we use only data from the location of interest. But often we are not interested in a specific location, and if we were, then we must have something to compare with. Therefore, we need a generalized model to predict different locations.
Of course, we are not yet satisfied with the model. We suggest that the project be extended to the whole of Switzerland. Then we would have dozens of measuring stations on different terrains. We could then classify the stations. For example, according to their exposure. Then a model could be created for all classes. For example, the model “summit” was trained with data from stations at summits. It could also make sense to divide the stations into regions or altitude levels. One has to make a trade-off between generalizability and accuracy. If we generalize, we have fewer models. However, this increases the error. If we do not generalize, then we have higher accuracy, but we have to use a different one for each location.
A very important parameter is the bias. We can see that our model underestimates GPP for Davos and underestimates for Lägern. The reason for that is not clear. In Davos, the model overestimate GPP about 150th day of the year (mid May). In Lägern this happens already for the 100th day of the year (mid April). It could be the elevation (see table 1). In April, Davos has still snow and the vegetation growth later. It also could be temperature because low temperature inhibit GPP.
Another explanation could be exposure (see table 1). In Lägern, the station is at a “summit”. That means there is more wind. Maybe the wind falsifies the measurement (but we do not think that because that would be very unprofessional). It is different whether the sun shins direct on the surface or is there always a shadow (different amount of energy per square meter).
In summary, we can say that we should not pool the data randomly. It is important that we compare same with same. We suggest, that we classify the measurement stations because topology, meteorological parameters or elevation influence the model.
(1) Cross-validation
https://bradleyboehmke.github.io/HOML/process.html#k-fold-cross-validation
(2) Meteorological Data
(3) Fluxnet